An Adaptive, Semi-Structured Language Model Approach to Spam Filtering on a New Corpus

نویسنده

Ben Medlock

چکیده

Motivated by current efforts to construct more realistic spam filtering experimental corpora, we present a newly assembled, publicly available corpus of genuine and unsolicited (spam) email, dubbed GenSpam. We also propose an adaptive model for semi-structured document classification based on language model component interpolation. We compare this with a number of alternative classification models, and report promising results on the spam filtering task using a specifically assembled test set to be released as part of the GenSpam corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Adaptive Approach to Spam Filtering on a New Corpus

Motivated by the absence of rigorous experimentation in the area of spam filtering using realistic email data, we present a newly-assembled corpus of genuine and unsolicited (spam) email, dubbed GenSpam, to be made publicly available. We also propose an adaptive model for semi-structured document classification based on smoothed n-gram language modelling and interpolation, and report promising ...

متن کامل

A Language Model Approach to Spam Filtering

We present a classification model for semi-structured documents based on statistical language modelling theory which outperforms extant approaches to spam filtering on the LingSpam email corpus [1]. We also introduce two variants of a novel discounting technique for higher-order N -gram language models developed in the light of the spam filtering problem.

متن کامل

Investigating classification for natural language processing tasks

This thesis investigates the application of classification techniques to four natural language processing (NLP) tasks. The classification paradigm falls within the family of statistical and machine learning (ML) methods and consists of a framework within which a mechanical ‘learner’ induces a functional mapping between elements drawn from a particular sample space and a set of designated target...

متن کامل

Single-Pass, Adaptive Natural Language Filtering: Measuring Value in User Generated Comments on Large-Scale, Social Media News Forums

There are large amounts of insight and social discovery potential in mining crowd-sourced comments left on popular news forums like Reddit.com, Tumblr.com, Facebook.com and Hacker News. Unfortunately, due the overwhelming amount of participation with its varying quality of commentary, extracting value out of such data isn't always obvious nor timely. By designing efficient, single-pass and adap...

متن کامل

AN EVALUATION OF FILTERING TECHNIQUES IN A NAÏVE BAYESIAN ANTI-SPAM FILTER by

An efficient anti-spam filter that would block all unsolicited messages i.e. spam, without blocking any legitimate messages is a growing need. To address this problem, this report takes a statistically-based approach, employing a Bayesian anti-spam filter, because it is content-based and self-learning (adaptive) in nature. We train the filter, using a large corpus of legitimate messages and spa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

An Adaptive, Semi-Structured Language Model Approach to Spam Filtering on a New Corpus

نویسنده

چکیده

منابع مشابه

An Adaptive Approach to Spam Filtering on a New Corpus

A Language Model Approach to Spam Filtering

Investigating classification for natural language processing tasks

Single-Pass, Adaptive Natural Language Filtering: Measuring Value in User Generated Comments on Large-Scale, Social Media News Forums

AN EVALUATION OF FILTERING TECHNIQUES IN A NAÏVE BAYESIAN ANTI-SPAM FILTER by

عنوان ژورنال:

اشتراک گذاری